37 research outputs found

    An Information Theoretic approach to Post Randomization Methods under Differential Privacy

    Get PDF
    Post Randomization Methods (PRAM) are among the most popular disclosure limitation techniques for both categorical and continuous data. In the categorical case, given a stochastic matrix M and a specified variable, an individual belonging to category i is changed to category j with probability Mi,j . Every approach to choose the randomization matrix M has to balance between two desiderata: 1) preserving as much statistical information from the raw data as possible; 2) guaranteeing the privacy of individuals in the dataset. This trade-off has generally been shown to be very challenging to solve. In this work, we use recent tools from the computer science literature and propose to choose M as the solution of a constrained maximization problems. Specifically, M is chosen as the solution of a constrained maximization problem, where we maximize the Mutual Information between raw and transformed data, given the constraint that the transformation satisfies the notion of Differential Privacy. For the general Categorical model, it is shown how this maximization problem reduces to a convex linear programming and can be therefore solved with known optimization algorithms

    Survival analysis via hierarchically dependent mixture hazards

    Get PDF
    Hierarchical nonparametric processes are popular tools for defining priors on collections of probability distributions, which induce dependence across multiple samples. In survival analysis problems, one is typically interested in modeling the hazard rates, rather than the probability distributions themselves, and the currently available methodologies are not applicable. Here, we fill this gap by introducing a novel, and analytically tractable, class of multivariate mixtures whose distribution acts as a prior for the vector of sample-specific baseline hazard rates. The dependence is induced through a hierarchical specification of the mixing random measures that ultimately corresponds to a composition of random discrete combinatorial structures. Our theoretical results allow to develop a full Bayesian analysis for this class of models, which can also account for right-censored survival data and covariates, and we also show posterior consistency. In particular, we emphasize that the posterior characterization we achieve is the key for devising both marginal and conditional algorithms for evaluating Bayesian inferences of interest. The effectiveness of our proposal is illustrated through some synthetic and real data examples

    More for less: Predicting and maximizing genetic variant discovery via Bayesian nonparametrics

    Full text link
    While the cost of sequencing genomes has decreased dramatically in recent years, this expense often remains non-trivial. Under a fixed budget, then, scientists face a natural trade-off between quantity and quality; they can spend resources to sequence a greater number of genomes (quantity) or spend resources to sequence genomes with increased accuracy (quality). Our goal is to find the optimal allocation of resources between quantity and quality. Optimizing resource allocation promises to reveal as many new variations in the genome as possible, and thus as many new scientific insights as possible. In this paper, we consider the common setting where scientists have already conducted a pilot study to reveal variants in a genome and are contemplating a follow-up study. We introduce a Bayesian nonparametric methodology to predict the number of new variants in the follow-up study based on the pilot study. When experimental conditions are kept constant between the pilot and follow-up, we demonstrate on real data from the gnomAD project that our prediction is more accurate than three recent proposals, and competitive with a more classic proposal. Unlike existing methods, though, our method allows practitioners to change experimental conditions between the pilot and the follow-up. We demonstrate how this distinction allows our method to be used for (i) more realistic predictions and (ii) optimal allocation of a fixed budget between quality and quantity

    Nonparametric Bayesian multi-armed bandits for single cell experiment design

    Get PDF
    The problem of maximizing cell type discovery under budget constraints is a fundamental challenge for the collection and analysis of single-cell RNA-sequencing (scRNA-seq) data. In this paper, we introduce a simple, computationally efficient, and scalable Bayesian nonparametric sequential approach to optimize the budget allocation when designing a large scale experiment for the collection of scRNA-seq data for the purpose of, but not limited to, creating cell atlases. Our approach relies on the following tools: i) a hierarchical Pitman-Yor prior that recapitulates biological assumptions regarding cellular differentiation, and ii) a Thompson sampling multi-armed bandit strategy that balances exploitation and exploration to prioritize experiments across a sequence of trials. Posterior inference is performed by using a sequential Monte Carlo approach, which allows us to fully exploit the sequential nature of our species sampling problem. We empirically show that our approach outperforms state-of-the-art methods and achieves near-Oracle performance on simulated and scRNA-seq data alike. HPY-TS code is available at https://github.com/fedfer/HPYsinglecell

    Mixture modeling via vectors of normalized independent finite point processes

    Full text link
    Statistical modeling in presence of hierarchical data is a crucial task in Bayesian statistics. The Hierarchical Dirichlet Process (HDP) represents the utmost tool to handle data organized in groups through mixture modeling. Although the HDP is mathematically tractable, its computational cost is typically demanding, and its analytical complexity represents a barrier for practitioners. The present paper conceives a mixture model based on a novel family of Bayesian priors designed for multilevel data and obtained by normalizing a finite point process. A full distribution theory for this new family and the induced clustering is developed, including tractable expressions for marginal, posterior and predictive distributions. Efficient marginal and conditional Gibbs samplers are designed for providing posterior inference. The proposed mixture model overcomes the HDP in terms of analytical feasibility, clustering discovery, and computational time. The motivating application comes from the analysis of shot put data, which contains performance measurements of athletes across different seasons. In this setting, the proposed model is exploited to induce clustering of the observations across seasons and athletes. By linking clusters across seasons, similarities and differences in athlete's performances are identified
    corecore